A calibrated Bayesian method for the stratified proportional hazards model with missing covariates

Missing covariates are commonly encountered when evaluating covariate effects on survival outcomes. Excluding missing data from the analysis may lead to biased parameter estimation and a misleading conclusion. The inverse probability weighting method is widely used to handle missing covariates. However, obtaining asymptotic variance in frequentist inference is complicated because it involves estimating parameters for propensity scores. In this paper, we propose a new approach based on an approximate Bayesian method without using Taylor expansion to handle missing covariates for survival data. We consider a stratified proportional hazards model so that it can be used for the non-proportional hazards structure. Two cases for missing pattern are studied: a single missing pattern and multiple missing patterns. The proposed estimators are shown to be consistent and asymptotically normal, which matches the frequentist asymptotic properties. Simulation studies show that our proposed estimators are asymptotically unbiased and the credible region obtained from posterior distribution is close to the frequentist confidence interval. The algorithm is straightforward and computationally efficient. We apply the proposed method to a stem cell transplantation data set.


Introduction
The proportional hazards (PH) model [9] is widely used to evaluate covariate effects on survival outcome. In many biomedical studies, covariate information is often incompletely observed for some subjects due to loss of follow-up, loss of hospital records, or study design. For example, Dreger el al. [11] studied 1,394 patients who were aged 18 years or older, had relapsed diffuse large B-cell lymphoma (DLBCL), and had undergone their first non-myeloablative or reduced-intensity conditioning allogeneic stem cell transplantation between 2008 and 2015. They compared the effect of the following donor types on overall survival: haploidentical family donors using post-transplant cyclophosphamide (PTCy), matched sibling donors (MSD) / matched unrelated donors (MUD) with or without T-cell depletion. There were missing records in hematopoietic cell transplant-comorbidity index. In addition, the PH assumption was not valid for remission status at time of transplant.
One simple way, but a commonly used method, to handle such incomplete data is excluding missing data from the analysis, which is called the complete-case (CC) method. However, the CC method may lead to biased parameter estimation when the missing mechanism is related to outcome variables. A more sensible way to handle missing data is using the propensity score weighting, which is based on a model for the response probability. Once the response probabilities are estimated, the inverse of the estimated response probability is applied as the weight for estimating parameters of interest.
Extensive work for the PH model with missing covariates has been done under the frequentist framework. Lin and Ying [23] proposed a pseudo-likelihood score function to handle missing covariates under the missing-completely-at-random assumption. Zhou and Pepe [44] and Chen and Little [5] proposed an estimated partial-likelihood method using auxiliary covariates and a nonparametric maximum likelihood method, respectively. Pugh et al. [29] proposed an inverse-probability-weighted (IPW) estimating equation when the missing mechanism is missing at random (MAR) in the sense of Rubin [31]. Wang and Chen [39] and Xu et al. [41] considered an augmented inverse probability weighting (AIPW) scheme [30] under the MAR assumption. Herring and Ibrahim [14] and Herring et al. [15] introduced Monte-Carlo-Expectation-Maximization-algorithm-based methods to handle missing covariates under the MAR assumption and non-ignorable missing, respectively. Multiple imputation for the PH model has also been explored [24,40,3]. However, all of these methods are limited to the non-stratified PH model. Because it is common that the PH assumption is not satisfied for some covariates [11,38,27] or stratified sampling is used for data collection [18,19], it is crucial to study the stratified PH model for missing data. In addition, many of these existing methods rely on Taylor expansion to study the asymptotic properties of the estimators and thus obtaining variance estimates of the estimators can be complicated in practice.
Besides the frequentist approaches, several Bayesian approaches were also proposed to handle missing covariates for survival data. Chen et al. [6] proposed a general class of informative priors for semi-parametric survival models incorporating a cure fraction. Yoo and Lee [42] extended the Bayesian adaptive B-spline estimation method of Sharef et al. [34] to the clustered survival data with missing covariates. Hemming and Hutton [13] considered Markov Chain Monte Carlo (MCMC) to handle missing covariates for the accelerated failure time model under the MAR assumption. For the PH model, Ibrahim et al. [17] and Bradshaw et al. [4] studied variable selection and non-ignorably missing time-varying covariates, respectively, under the MAR assumption. Chen et al. [8] considered covariates with detection limit. In general, the current Bayesian literature for missing covariates for the PH model requires complicated MCMC algorithms that generate correlated samples and the burn-in period. Finding straightforward and less computationally intensive methods for survival outcomes with missing data is desirable in practice. Furthermore, like the frequentist approaches discussed previously, these Bayesian methods did not consider the stratified PH model. Sang and Kim [32] recently proposed an approximate Bayesian method to handle unit nonresponse for outcomes. Their Bayesian method is calibrated to the frequentist inference in that the credible region obtained from the posterior distribution asymptotically matches the frequentist confidence interval. Although their method is an approximate method based on the asymptotic normality of the score function, it is attractive in practice due to its simplicity and good performance in parameter estimation. More specifically, their algorithm generates independent samples for the posterior distribution of parameters and does not require MCMC for posterior computation.
Survival data often have multiple missing covariates. For example, Pidala et al. [26] studied the impact of DPB1 T-cell epitope matching on the hematopoietic cell transplantation outcomes for patients with leukemia and myelodysplastic syndrome. There were two important variables with missing values: DPB1 T-cell epitope matching and Karnofsky performance score (KPS). Existing literature considers one missing mechanism to handle these two missing covariates. However, the missing mechanism for DPB1 T-cell epitope matching could be different from one for KPS. In addition, cases having missing values in both variables could have a different missing mechanism from those with missing values in one of the two variables. This important aspect has been largely ignored in the current literature.
Motivated by Sang and Kim [32], we propose an approximate Bayesian approach for survival data with missing covariates. We consider the stratified PH model so that it can be used even for data with the non-proportional hazard structure. The proposed method uses two score equations, one from the stratified PH model and the other from the response model, to construct the likelihood part of the posterior distribution. With a flat prior, the credible interval from the posterior distribution asymptotically matches the confidence interval of the frequentist IPW approach. In contrast with the frequentist approach, the proposed method does not involve Taylor expansion. Sampling from the posterior distribution is straightforward to implement. We also propose new methods to take account of multiple covariates missing patterns. We study the asymptotic properties of the IPW estimators in frequentist inference for the stratified PH model with known and unknown missing mechanisms. Simulation studies and applications to the data of Dreger et al. [11] and Pidala et al. [26] are also provided.

Model definitions and assumptions
Assume the full cohort consists of n subjects and there are L strata, where L is fixed. Let T li be the failure time, C li be the potential censoring time, and Z li = (Z li1 , …, Z lip ) T be a p × 1 time-independent covariate vector for subject i in stratum l for l = 1, …, L, i = 1, …, n l , where n l is the number of subjects in stratum l. Let X li = min(T li , C li ) denote the observed time in the full cohort and Δ li = I(T li ≤ C li ) be an event indicator. The study period is [0, τ].
We consider the stratified PH model: for subject i in stratum l, the hazard function λ li ( ⋅ ) associated with Z li is where λ 0l (t) is a baseline hazard function for stratum l and β 0 is an unknown parameter vector. We assume T li is independent of C li given Z li [9,10]. Next, we introduce notation and assumptions for missing covariates.
where Z li c and Z li m are the complete covariate vector and the missing covariate vector for subject i in stratum l, respectively. Let ξ li be the observation indicator for subject i in stratum l: ξ li = 1 if Z li is fully observed and ξ li = 0 if some elements of Z li are missing. Assume (T li , C li , Z li , ξ li ) for i = 1, …, n l within stratum l are independently and identically distributed. In addition, (T li , C li , Z li , ξ li ) and (T l′i′ , C l′i′ , Z l′i′ , ξ l′i′ ) are assumed to be independent when l ≠ l′. The observed data for stratum l is (X li , Δ li , Z li c , ξ li ) for i = 1, …, n l .
Define W li = (X li , Δ li , Z li c ) for i = 1, …, n l and l = 1, …, L. We assume that Z li m is missing at random in that the probability of observing missing covariates is conditionally independent of Z li m given W li . Let N li (t) = I(X li ≤ t, Δ li = 1) be the counting process for the observed failure time, and Y li (t) = I(X li ≥ t) denote the at-risk indicator for subject i in stratum l, where I( ⋅ ) is an indicator function.

Inverse Probability Weighted Estimators
The inverse probability weighted (IPW) estimator based on Horvitz and Thompson [16] adjusts for missing covariates by using the inverse of the response probability as the weight [39,41]. We assume the ξ li within stratum l is independently generated from a Bernoulli distribution and allow a different intercept for each stratum: where ω li = (1, I(l = 2), …, I(l = L), W li T ) T . Let ϕ 0 be the true parameter vector of ϕ. To where S l (d) (β, t) = n −1 ∑ i = 1 n l ξ li π li −1 Y li (t)Z li ⊗ d e β T Z li for d = 0, 1 and a ⊗ 0 = 1, a ⊗ 1 = a, a ⊗ 2 = aa T . The two functions U 1, n (β, ϕ) and U 2, n (ϕ) are the score functions from the stratified PH model and the logistic regression, respectively. For the frequentist IPW approach under the MAR assumption, one first estimates response probabilities π li 's by solving (4) and then plugs the estimated π li 's into (3) to estimate β. Let the solution to (3) be β. In practice, one relies on Taylor expansion to obtain the asymptotic normality of β [41].
To avoid Taylor expansion, instead of generating from p(θ|θ), we alternatively consider p(θ | U n ) ∝ p(U n (θ) | θ)p(θ), (5) where p(U n (θ)|θ) is the sampling distribution of U n (θ). To generate samples from (5), we consider a one-to-one transformation T : θ η such that η = E(U n |θ). Then, we generate η * from p(η|U n ) and obtain θ * = T −1 (η * ) as samples for the posterior distribution in (5). As in the next section, under some regularity conditions, the asymptotic distribution of U n is where d is convergence in distribution and Σ is the asymptotic covariance matrix of the joint score functions. Since the transformation T : θ η is one-to-one, (6) is equivalent to n{U n (T −1 η) − η} | η d N(0, Σ) .
Then, the posterior distribution of η given U n is p(η | U n ) ∝ p(U n | η)p(η), (8) where p(U n |η) is the density of the limiting distribution in (7). Equation (8) shows an important relationship between the frequentist IPW approach and the Bayesian approach. Under a flat prior for p(η) and the sufficient conditions for (7), we can approximate the posterior distribution p(η|U n ) as follows: p(η | U n ) ∼ N(U n , Σ/n) . (9) As shown in the Appendix, the estimator of Σ, Σ, can be obtained by

Asymptotic properties
We now study the asymptotic properties of p(θ|θ) in this section. To establish the consistency and the asymptotic normality of the IPW estimator for the stratified PH model, we assume the following conditions: C2 |Z lik (0)| + ∫ 0 τ |dZ lik (t)| < D z < ∞, l = 1, …, L, i = 1, …, n l and k = 1, …, p almost surely where D z is a constant; C3 For d = 0, 1, 2, there exists a neighborhood ℬ of β 0 such that s l (d) (β, t) is continuous and C5 The matrix V l ϕ is positive definite and π li ≥ ϵ > 0 for i = 1, …, n l and l = 1, …, L, where C8 lim n ∞ n l /n = q l , where q l ∈ (0, 1) for all l = 1, …, L; C9 As n ∞, sup θ ∈ Θ U n (θ) − η(θ) p 0, where Θ is the parameter space; C10 The map θ U n (θ) is continuous and has exactly one zero θ with probability one; C11 Equation η(θ) = 0 has exactly one root at θ = θ 0 ; C12 There exists a neighborhood of θ 0 , denoted by J n (θ 0 ), on which with probability one all U n (θ) are continuously differentiable and the Jacobian ∂U n (θ)/ ∂θ converges uniformly to a non-stochastic limit which is non-singular. Here, J n (θ 0 ) is a ball with center θ 0 and radius r n satisfies r n ∞ and r n n ∞; C13 For any θ ∈ J n (θ 0 ), given θ: holds for some Σ(θ) = V ar{ nU n (θ)|θ} that is positive definite and independent of n.
Conditions C1-C8 are the standard conditions for the consistency and asymptotic normality of β [1,41]. Conditions C9-C13 are needed for the asymptotic properties of the joint IPW estimator. More specifically, as long as the samples satisfy some moment conditions, condition C9 holds. Conditions C10 and C11 ensure the existence and uniqueness of the solutions to U n = 0. Condition C12 regulates the derivatives of U n and ensures its covariance converges. Condition C13 provides the asymptotic distribution for the estimating equation.
The proof of C13 can be found in Theorem 6 of Yuan and Jennrich (1998) [43], where Yuan and Jennrich (1998) [43] studied the large sample properties including the existence, strong consistency, and asymptotic normality of the estimators generated from samples that are not necessarily identically distributed under very general assumptions. Under Conditions C1-C13 we can show θ is a consistent estimator for θ and asymptotically normally distributed Following Sang and Kim [32], we assume the following conditions to establish the posterior consistency and asymptotic normality.
Condition C14 is a standard assumption for the prior and the flat prior satisfies this condition. Condition C15 implies the covariance estimator should be consistent. Conditions C16 to C17 are the sufficient conditions for the posterior distribution to be approximated by the proposed method. Soubeyrand and Haon-Lasportes [35] also used similar conditions to C14 and C16 to justify their approximate Bayesian computation methods. All the conditions can be easily satisfied if we assume covariance estimator is Lipschitz continuous in θ and has bounded eigenvalues as discussed in Sang and Kim [32].
Similarly to Xu et al. [41], we can establish the following asymptotic property for β under the stratified PH model: Theorem 1 Assume Conditions C1-C8 in Section 3.
2. If π li is known, n(β − β 0 ) is asymptotically normally distributed with mean 0 and Its proof is a straightforward extension of Theorem 2 of Xu et al. [41] to the stratified PH model and thus omitted. Using (13), one can develop a plug-in variance estimator of β , but it can involve a quite computation.
Next we have the following asymptotic property for estimators from the stratified PH model with two missing covariates having multiple missing patterns as in Section 2.4.
Its proof is similar to the proof of Theorem 4.1 of Sang and Kim [32] and thus is omitted. Results (14) and (15) show the convergence of the posterior distribution to a normal distribution and the posterior consistency, respectively. In particular, (14) implies the confidence region from the proposed Bayesian method is asymptotically equivalent to the frequentist confidence region based on asymptotic normality of θ. Thus, our proposed Bayesian method is calibrated to frequentist inference.
The proposed Bayesian estimators for ϕ and β can be obtained by the medians of the draws from the approximated posterior distribution. Because the posterior distribution is approximately normal by Theorem 3, one can construct the confidence region using the equal-tailed credible interval (ETI) or the level-α Bayesian High Posterior Density credible region using j * defined as C * (α) = {θ: P (θ|θ) ≥ j * (α)} [7].

Simulation
We conducted two simulation studies to investigate the finite sample properties of the approximate Bayesian method and the IPW method for stratified data. We compared them with the CC method.
In the first simulation, we considered a stratified PH model with two strata, i.e., L = 2. Two covariates were generated for each stratum: Z 11 from the Bernoulli distribution with probability 0.4 and Z 12 from the standard normal distribution for stratum 1; Z 21 from the Bernoulli distribution with probability 0.6 and Z 22 from the normal distribution with mean 1 and standard deviation 0.7 for stratum 2. Event times were generated based on the stratified PH model (1). We considered λ 10 (t) = 4t 3 for stratum 1 and λ 20 (t) = 2/3t −2/3 for stratum 2. We set β = (β 1 , β 2 ) T = {log (2), log(2)} T .
Four sample sizes were considered: n = 500, 1000, and 2000. Table 1 summarizes the simulation results based on B = 1000 Monte Carlo samples. For the proposed method, we obtained 1000 posterior medians and calculated the bias of the average of the 1000 medians, their standard deviations (SD), and the average percentage that 95% ETIs include the true parameters (CR E ).s For IPW, CC methods, the bias of the average of mean, their average of standard errors (SE), and 95% coverage rates (CR) were calculated. As seen in Table 1, the average of the posterior medians from the approximate Bayesian method and the average of the IPW estimators were close to the true values. All standard deviations of the approximate Bayesian method are close to the average of standard errors of the IPW method. The range of the average percentages that 95% ETIs include the true parameters and the coverage rates of the IPW method are between 93% and 96%. These results are consistent with Theorem 3. Standard deviations and the average of standard errors are larger as event rate is lower or sample size is smaller. In contrast, the CC method has biases and low coverage rates ranged from 71% to 90% for completely observed covariate Z c , which are well below 95%.
Furthermore, this phenomenon becomes more severe as the sample size increases.
We also conducted simulations for correlated covariates, different β's, and missing rates.  Table 1. When magnitude of β is larger and missing rate is higher, the CC method performed worse.

Real data application
We applied the proposed approximate Bayesian and IPW methods to the following two registry data sets: 1) the stem cell transplantation (HCT) data which Dreger et al. [11] analyzed to study patients with DLBCL; 2) the HCT data which Pidala et al. [26] investigated to study patients with myelodysplastic syndrome (MDS). The DLBCL data and the MDS data are for the stratified PH model with a single missing covariate and with two missing covariates, respectively. Thus, we applied the methods of Section 2.2 and 2.3 to the DLBCL data, and we used the methods of Section 2.4 for the MDS data.

Stratified PH model with a single missing covariate
The DLBCL data [11]  , Papanicolaou et al. [25], and Ustun et al. [37]). Thus, we adjusted these five covariates in the model. Four hundred eighty four patients (35%) have missing values in HCT-CI. We tested the PH assumption for each covariate by testing whether the coefficient of log t × Z is equal to zero for each variable [20]. Remission status at time of transplant did not satisfy the PH assumption at a significant level 0.05 (p-value = 0.0047). Thus, we stratified the PH model according to remission status.
We fitted the logistic regression to obtain the propensity score by allowing a different intercept for each stratum. Six variables including the stratum variable, year of transplant, age group, donor type, death indicator, and time to death were statistically significant at a significance level 0.05. We used these six variables to calculate propensity scores. We fitted the approximate Bayesian method, IPW method, and the CC method. Table 3 reports the analysis result including i) the posterior median β m and 95% ETI (ETI) for the approximate Bayesian method; ii) β and its 95% confidence intervals for the IPW method and the CC method. As expected, the results from the approximate Bayesian method and the IPW method were similar. However, the 95% confidence intervals of the IPW method are slightly wider than the 95% ETIs of the approximate Bayesian method in general. On the other hand, the effects of donor group, HCT-CI score, and Karnofsky score from the approximate Bayesian method and the IPW method were similar to those from the CC method. However, the effects of year of transplant were different: based on the approximate Bayesian method and the IPW method, patients who had HCT from 2008 to 2010 were more likely to die after HCT than those who had HCT from 2013 to 2015. The results from the approximate Bayesian/IPW methods and CC methods are different in donor group and year of transplant. In particular, although year of transplant did not reach statistical significance in the model from CC methods, the parameter estimates for 2008-2012 were negative. Thus, the results from CC methods imply patients who got transplant in recent years (2013-2015) experienced worse survival than earlier years (2008-2012). It is common in DLBCL studies that the progress of patients who got HCT in recent years was better than those who got HCT in earlier years [22,2,33]. Thus, the results on year of transplant from the CC method are counter-intuitive. In contrast with these, the year of transplant effects from the approximate Bayesian method and the IPW method are consistent with the current medical literature. The effects of year of transplant and HCT-CI contradict the current HCT literature.
We conducted a sensitive analysis by examining various propensity score models to investigate the missing-not-at-random assumption. We observed that the progress of patients who got HCT in recent years was worse than those who got HCT in earlier years which is inconsistent with the current HCT literature. Thus, the MAR assumption appears to be reasonable for this data set.

Stratified PH model with two missing covariates
The MDS data [26] for the analysis consisted of 787 adults or children with diagnoses of MDS, who underwent first myeloablative-unrelated bone marrow or peripheral blood stem cell transplantation conducted between 1999 and 2011 patients. An outcomes of interest in this analysis was an overall survival.  (1999-2002, 2003-2006, 2007-2011), and KPS (<90, ≥90). We tested the PH assumption for each covariate by testing whether the coefficient of log t × Z is equal to zero for each variable [20].
Graft type and year of transplant were not satisfied the PH assumption at a significant level 0.05. Thus, we fitted stratified PH model. We used the approximate Bayesian method/IPW method for multiple missing patterns of Section 2.4 and compared with the approximate Bayesian method/IPW method for a single missing pattern of Section 2.2 and 2.3, and the CC method. While none of covariates were significant for two missing categories for 1) only KPS missing; 2) both HLAD and KPS in the propensity score model at the significance level 0.05, three variables including year of transplant, death indicator, and time to death were significant in the propensity score model for only HLAD missing. Thus, applying the method for multiple missing patterns appeared to be more appropriate than using that for a single missing pattern. Table 4 reports the analysis results including i) the posterior median β m and its 95% ETI (ETI) for the approximate Bayesian method with multiple missing patterns and a single missing pattern; ii) β and its 95% confidence intervals for the IPW method, and the CC method. Results of approximate Bayesian/IPW methods with multiple missing patterns and a single missing pattern for KPS and race were similar. However, HLAD and age group show difference in results between the approaches for multiple missing patterns and those for a single missing pattern. While the effect of permissive classification group compared with fully matched classification group is positive from the approximate Bayesian/IPW methods for multiple missing patterns, it is negative from the approximate Bayesian/IPW methods for single missing. The results from the approximate Bayesian/IPW methods for multiple missing pattern show that the 95% CI and ETI of age 20-49 group did not contain 0, but the 95% CI and ETI of age 20-49 group contained 0 when considering a single missing pattern. The results for the CC method show the 95% CI and ETI of race other group did not contain 0 while those from the approximate Bayesian/IPW methods for multiple missing patterns contained 0.

Concluding Remarks
We have proposed new approximate Bayesian and IPW methods for the stratified PH model with incomplete covariate information. In particular, we studied multiple missing patterns, which is largely ignored in the current literature. Using the flat prior, the proposed Bayesian method is asymptotically equivalent to the frequentist IPW inference using Taylor linearization. The proposed Bayesian method can improve its performance if the prior is informative. In this case, it may be more efficient than the frequentist IPW method. The approximate Bayesian method can be further improved by adding an augmented term to the score function for the stratified PH model similarly to Wang and Chen [39] and Xu et al. [41].
The scheme of the proposed methods can also be applied to competing risks data. Exploring the cause-specific hazards model [28] and the proportional subdistribution hazards model [12] will be an interesting research topic. We only studied missing covariates in this article.
In HCT studies, it is common that a portion of either time to event or outcome indicators are also missing. Handling missingness in such outcomes would be an important research problem. For causal inference, the IPW approach is widely used to adjust for the probability of treatment assignments in observational studies. Applying the proposed Bayesian method to causal inference would be a worthy future topic. The proposed methods require specifying the propensity score model correctly. One can consider nonparametric regression for the propensity score estimation or a doubly robust estimator [30] for robust estimation. Pursuing this direction would be an important research topic in the future.

Supplementary Material
Refer to Web version on PubMed Central for supplementary material.
We derive Σ and its estimator in the Appendix. Let dM li (t) = dN li (t) − Y li (t)exp{β T Z li }dΛ l (t).
Kim et al.      Table 4 Stem cell transplantation data analysis for multiple missing patterns